
\chapter{Lexical Structure} \label{lexical structure}

\section{Notational Conventions}

In this and the subsequent chapters, subsets of the \frege{} grammar are given in running text. The complete grammar is the set of rules that is the union of all those subsets.

\par A
grammar rule defines a nonterminal symbol as a sequence composed of terminal symbols, nonterminal symbols or subrules indicating that the enclosed sequence \opt{is optional} or may occur \some{zero or more} times.
Nonterminal symbols and variable terminal symbols are written in \nont{italics}, constant terminals in bold \term{typewriter} font.
The definition of a nonterminal starts in the left margin and may
consist of alternative rules. Alternatives are separated by a
line break, some indent and a vertical bar.

\par In this section particularly, the lexical syntax of terminal symbols of the grammar will be defined by regular expression.
\index{expression!regular}
We use regular expressions as defined in the documentation of class
\texttt{java.util.regex.Pattern} in \cite{apidoc}. Regular expression
will appear in colored typewriter font like this
\regex{$\backslash{}$s?}.

In order to make things more readable, sometimes the name of a terminal symbol is used in a regular expression. The meaning is here to replace the terminal symbol with the regular expression defining it. For example:

\begin{flushleft}
\rul{digits}  \regex{$\backslash{}$d+}\\
\rul{float}   \term{digits}\regex{($\backslash{}$.}\term{digits}\regex{)?}\\
This is the same as:\\
\rul{float} \regex{$\backslash{}$d+($\backslash{}$.$\backslash{}$d+)?}
\end{flushleft}

Likewise, instead of \regex{foo|bar} we sometimes write \regex{foo}\oder{}\regex{bar}.

All regular expressions are to be understood to be anchored at the start of the yet unprocessed portion of the program text.

\par Some symbols like parentheses, separators and so on stand for themselves and are specified verbatim each time they occur in the grammar. To distinguish verbatim symbols like \sym{<- , ; ::} etc. from meta-syntactical symbols such as {\Large $|$} and \opt{..} and from regular expressions, we write them in a different colour.

\section{Unicode}

The \frege{} compiler will convert program text into a sequence of \href{https://en.wikipedia.org/wiki/Code_point}{Unicode code points}, according to the encoding of the input.  
All encodings that are supported by the \java{} platform are supported in \frege{}.
For detailed information consult the \java{}
API documentation \cite{apidoc}. \index{character!encoding}

The standard encoding is  \href{https://en.wikipedia.org/wiki/UTF-8}{UTF8}\footnote{
This is a diversion from the usual practice in the \java{} world, where the \java{} software assumes a so called \emph{platform specific default} encoding. Unfortunately, this promotes  encoding text files in character sets that are obsolete since decades, like \href{https://de.wikipedia.org/wiki/Codepage_850}{CP850} and in turn, this leads to countless problems. Not the least one being that people "know" that using "strange" characters will break everything, and so they simply don't them. 

Therefore, and in an attempt to avoid proprietary encodings, the \frege{} standard library uses UTF8 encoding everywhere by default.}, 
but can be changed for a compiler run through command lime options. It is, however, strongly recommended to configure text editors to produce UTF8-encoded \frege{} source files.

While it is possible to compose names from
valid Unicode letters, 
one should keep in mind that extensive use of
this feature will make the program text
difficult, 
if not impossible, 
to understand for members of different cultural background.
However, there is no objection against using Unicode symbols for self defined \hyperref[operator]{operators}.

Certain reserved words and operators can be abbreviated with Unicode symbols. The following table shows what is supported.

\begin{figure}[hbt] 
	\begin{tabular}{lll}
Unicode Name & Symbol    & Can be written for     \\
\hline
U+2200 FOR ALL & $\forall$ & \term{forall} keyword\\
U+2190 LEFTWARDS ARROW & $\leftarrow$ & left arrow \texttt{<-}\\
U+2192 RIGHTWARDS ARROW & $\rightarrow$ & right arrow \texttt{->}\\
U+21D2 RIGHTWARDS DOUBLE ARROW & $\Rightarrow$ & right double arrow \texttt{=>}\\
U+2237 PROPORTION & $::$ & double colon \texttt{::}\\
U+2026 HORIZONTAL ELLIPSIS &  … & range operator \texttt{..}\\
\hline\\
	\end{tabular}
	\caption{Supported Unicode replacements for Frege symbols}
	\label{unisymbole}
\end{figure}

%, double colon (∷ U+2237 PROPORTION), the arithmetic sequence operator (… U+2026 HORIZONTAL ELLIPSIS) and forall (∀ U+2200 FOR ALL). If those characters appear on their own and are not part of an operator name, they are taken as the special symbols <-, ->, =>, ::, .. or the keyword forall, respectively.

The symbols in \autoref{unisymbole} have their special meaning only if they stand on their own. For example,
$\leftarrow>$ is \emph{not} the same as \texttt{<->}.

\section{Lexical program structure}

\begin{flushleft}
\rul{program} \some{\nont{line}}\\
\rul{line} \some{\nont{whitespace} \nont{token}} \nont{whitespace}\\
\rul{whitespace} \regex{$\backslash{}$s*}\\
\rul{token} \nont{varid} \oder \nont{conid} \oder \nont{keyword} \oder \nont{qualifier} \oder \nont{parentheses} \oder \nont{specialsym}
\\\hspace{0.5in} \oder \nont{lexop} \oder \nont{literal}
\end{flushleft}

A program is made up of lines.
Source code is broken into lines before tokenization by appropriate
input reading functions that will recognize and strip line separator
characters typical for the underlying operating system. 

With the exception of \hyperref[doccomment]{documentation text} there is no token that would
extend over more than one line.

Each line contains zero or more tokens separated by whitespace. Still
more whitespace can occur before the first token or after the last
token.

Note that the definition of \nont{whitespace} allows for the empty
string of whitespace characters. Consequently, tokens may appear not to
be separated at all.

The possibility of zero length whitespace does not mean that whitespace
may be dismissed altogether. On the contrary, whenever two tokens appear
in sequence where a non empty prefix of the second token might also be a
valid suffix of the first one, nonempty white-space is required to allow
for unambiguous tokenization. In other words, the tokenization algorithm
will recognize the longest prefix of the remaining characters on the
current line that form a valid token.

Consider the following example:

\begin{code}
  a+2
  a2
  a 2
  a
    2
\end{code}

There are 3 tokens in the first line and one token in the second line.
Since a digit is a valid suffix of an identifier, a space must occur
between a and 2 to obtain two tokens, as shown on the third line.
Another possibility to separate the tokens would be to write them on
different lines, as shown in the last two lines.

\section{Comments} \index{comment}

Comments can appear everywhere whitespace can.

\begin{flushleft}
\rul{comment} \nont{linecomment} \oder{} \nont{blockcomment}\\
\rul{linecomment} \regex{---?.*} \gcom{line comment extends to end of line}\\
\rul{blockcomment} \regex{(?s)$\backslash$\{--?(.|}\nont{blockcomment}\regex{)*-$\backslash$\}}\\
\end{flushleft}

The character sequence making up a block comment may extend over multiple lines
\footnote{This is the only exception to the rule that no token crosses line
boundaries.}. 
Because block comments do nest, any occurrence of  \regex{-\}} or \regex{\{-} within the
commented text will interfere with the nesting.

\hasdiff{\\
A user defined operator (see \autoref{operator}) must not start with the comment introducing characters \texttt{\{-} or \texttt{--}.
}

\subsection{Documentation Text} \label{doccomment} \index{comment!documentation}

Block comments starting with \regex{\{--} and line comments starting with  \regex{---} are treated as \textit{documentation text}.
Unlike an ordinary comment, a documentation text is a token and hence is not only lexically but also syntactically relevant.

There are only certain places where documentation text may appear, as will be detailed in this section. 
In order not to complicate matters, subsequent sections will not mention documentation text anymore.

A documentation text may appear:
\begin{enumerate}
\item before the \term{module} keyword that starts a module. This will be the documentation for the module.
\item in place of a \hyperref[declarations]{top level declaration} or a declaration in the \term{where} clause of a data, class or instance declaration. It can also appear immediately before such a declaration.
This will be the documentation for the subsequent declared item. 
\item immediately either before or after a \nont{constructor} in a \hyperref[algdcl]{data declaration}. This will be the documentation for that constructor.
\item immediately before or after a constructor field. In the latter case, no comma must be written before the next constructor field. 
\end{enumerate}

For convenience, in cases 1 and 2 a sequence of documentation comments optionally separated by semicolon can be written.
The text of the documentation comments will be concatenated with an interleaving paragraph break.

\begin{code}
--- this module is documented
{-- second paragraph of module documentation -}
{-- third paragraph of module documentation-}
module D where

--- this is the list of Fibonacci numbers
fib = 1:1:zipWith (+) fib (tail fib)
--- documentation for type D
data D = 
      --- documentation for constructor C
      C { name :: String --- documentation for name, no comma here
          {-- documentation for age -}
          age :: Int }
    | N                  --- documentation for constructor N
\end{code}

Documentation text will be copied verbatim and it will be available in the binary results of compilation (e.g. \java{} class files), so that documentation processing tools can access it to generate documentation in various formats.


\section{Identifiers and Keywords} \index{identifier} \label {qualified names}

\begin{flushleft}

\rul{varid} \regex{($\backslash$p\{Ll\}$|$\_)($\backslash$d$|$\_$|$'$|\backslash$p\{L\})*}\\

\rul{conid} \regex{$\backslash$p\{Lu\}($\backslash$d$|$\_$|$'$|\backslash$p\{L\})*}\\


\rul{qualifier}
\regex{conid$\backslash$.}\\

\rul{qvarid} \nont{qualifier}  \nont{qualifier} \nont{varid} \oder{} \nont{qualifier} \nont{varid} \oder{} \nont{varid}\\
\rul{qconid} \nont{qualifier}  \nont{qualifier} \nont{conid} \oder{}  \nont{qualifier} \nont{conid} \oder{} \nont{conid}\\
\end{flushleft}

These rather complicated regular expressions deserve some further
explanation.

We distinguish lexically between two classes of identifiers.
Names for functions, values and local variables are \nont{varid}s
and start with a lowercase letter\footnote{Some scripts like Devanagari don't have a notion of upper- or lowercase. Hence, for our purposes we define a lowercase letter as a character that has the unicode attribute LETTER, but not UPPERCASE.} or an underscore.
Names that start with an uppercase letter (\nont{conid}s)
stand for value constructors,
type constructors, type classes, type aliases or name spaces.

Sometimes it is necessary to name an item that is defined in another
module or in the scope of a type or type class. Thus we need qualified
names, defined here as \nont{qvarid} and \nont{qconid}. They are formed
by writing one or two \nont{qualifier}s before a \nont{varid} or \nont{conid}.

A \nont{qualifier} consists of an identifier starting
with an uppercase letter and an immediately
following dot. The identifier may denote name spaces, types or type
classes. A \nont{qualifier} like \texttt{Foo.} is a single
token and thus may not contain spaces.

According to this, the syntax allows reference to items in the following ways:

\colorquote{hell}{
$N$.$T$.$v$ \hspace{0.3cm} $N$.$T$.$C$ \\
$T$.$v$  \hspace{0.3cm} $T$.$C$ \hspace{0.3cm}
$N$.$v$  \hspace{0.3cm} $N$.$C$ \hspace{0.3cm} $N$.$T$ \\
$v$ \hspace{0.3cm} $C$ \hspace{0.3cm} $T$
}

where $N$ would be a name space, $T$ a type or class name, $C$ a
data constructor name and $v$ the name of a function or pattern binding.

There are rare cases where it is possible to confuse the dots in the qualifiers with
the special operator \texttt{.} explained later, an example can be found  \hyperref[confusedots]{here}. 
Fortunately, such constructs can be disambiguated with spaces or parentheses.

\note{Unlike in other languages, \frege{} identifiers can contain apostrophes. 

A single underscore is not an identifier, but a symbol that has special meaning in patterns and expressions.}


\subsubsection{Name Resolution and Scope}

Names appearing in expressions and types are resolved by the following rules, where $N$, $T$ and $C$ stand for \nont{conid}s and $v$ for \nont{varid}s:

\begin{description}
\item [Names of the form $v$:] every enclosing lexical scope provided by a \term{let}, lambda expression or case alternative is searched in turn for the name.  If it is found, then it refers to an item defined in a \term{let} expression or a (part of a) pattern in a lambda expression or case alternative. Otherwise, it must be a globally visible item. If $v$ appears in the scope of a data definition, class definition or instance definition and there is a variable or function binding with the name $v$ then it is resolved to mean this binding except when this is an implementation of a type class operation which has a simple name in the current module. In that case, the name resolves to that class operation. Otherwise, it must be a global function or variable binding or a class member.
\item [Names of the form $T$ or $C$:] $T$ may appear in type signatures, where it denotes a type constructor, type name or class name, either an imported one or one that is declared in the current module. In expressions and patterns, $C$ denotes a value constructor.
\item[Names of the form $N$.$T$ or $N$.$C$:] $N$ must be a name space denoting an imported module, a data type or a class. $T$ must be a class name, type name or $C$ must be a data constructor from this name space. While it is possible, that a type and a data constructor have the same name this does not introduce ambiguities because syntactically either a type name $T$ or a data constructor $C$ can be meant, but not both.

It is also possible that a type name and a name space of an imported module have the same name. In this case, only the name space of the imported module is searched. If one needs to access $C$ in the name space of the type $N$.$N$ one needs to write a qualified name of the form $N$.$N$.$C$.
\item[Names of the form $N$.$v$ or $T$.$v$:]
$N$ must be name space denoting an imported module or $T$ must denote a data type, type alias or a class.
$v$ is the name of a function or pattern binding in $N$ or $T$. Again, if there is a name space $N$ and a type $T$ and $N = T$, then only $N$ is searched.
\item[Names of the form $N$.$T$.$C$ or $N$.$T$.$v$:] Fully qualified names denote a function or pattern binding or a data constructor belonging to type or class $T$ from name space $N$.
\end{description}

\subsubsection{Keywords}

Some character sequences that would otherwise be matched by rule \nont{varid} are keywords and will be recognized by the scanner as distinct terminal symbols.

\begin{flushleft}
\label{keyword} \index{keyword}
\regex{
 abstract case class data default derive deriving do
 else false forall foreign if import in
 infix infixl infixr
 instance interface let module native newtype of
 package private protected public
 then throws true type where
}
\end{flushleft}

The words \term{pure} and \term{mutable} are only recognized as keywords when immediately followed by the 
\term{native} keyword, otherwise they are regarded as ordinary \nont{varid}s.

\section{Operators} \label{operator} \index{operator!user defined}  \label{fixity} \index{declaration!top level!fixity declaration}

\frege{} supports user defined binary infix operators. In addition, ordinary functions or value constructors can syntactically act as infix operators. In fact, an \emph{operator} is nothing but a reference to a function with a possibly funny name. Such a function name has certain user definable syntactic properties, like associativity and precedence, which makes it possible to construct \hyperref[binex]{expressions in infix notation}.

An lexical operator (rule \nont{lexop}) is either

\begin{enumerate}
	\item a \nont{varid} or \nont{conid} enclosed in acute accent marks. Such operators can also be qualified, but only the last part is to be enclosed in acute accent marks.
	\item a sequence of characters that does not contain one of the following: comma, semicolon, 
	grave accent mark, apostrophe, double quote, acute accent mark,
	parentheses, braces, brackets, underscore, letters, digits or whitespace. Such characters are called \emph{operator characters}.
	
	Certain sequences of 1 or 2 operator characters form
	terminal symbols with special syntactic meaning
	(rule \nont{specialsym}). These symbols cannot be used as infix operators. But the characters they are made of \emph{can} be used in different operators.
\end{enumerate}

Operators may be introduced with a top level infix declaration (rule \nont{fixity}).

\begin{flushleft}
\rul{fixity} \nont{infix} \nont{precedence} \more{\nont{infixop}}
\\
\rul{infix} \term{infix} \oder{} \term{infixl} \oder{} \term{infixr}\\
\rul{precedence} \regex{[123456789]} \oder{} \regex{1[0123456]}\\

\rul{symop}  \regex{$\backslash{}$W+} \gcom{operator characters only}\\
\rul{infixop} \term{symop} \oder{} \term{varid} \oder{} \term{conid}\\
\rul{lexop}  \regex{\term{qualifier}\{0,2\}}\regex{(`}\term{infixop}\regex{`$|$}\term{symop}\regex{)}\\
\rul{specialsym} \regex{::} \oder{}
   \regex{->} \oder{}
   \regex{<-} \oder{} \regex{=>} \oder{}
   \regex{$\backslash{}|$} \oder{}
   \regex{=} \oder{} \regex{-} \oder{}
   \regex{!} \oder{} \regex{?} \oder{}
   \regex{,} \oder{} \regex{;} \oder{}
   \regex{$\backslash{}$.} \oder{}
   \regex{$\backslash{}\backslash{}$} \oder{}
   \regex{\_} \oder{}
   \regex{$\forall$}\\

\rul{parentheses} \regex{$\backslash{}$(} \oder{} \regex{$\backslash{}$)} \oder{} \regex{$\backslash{}$[} \oder{} \regex{$\backslash{}$]} \oder{} \regex{$\backslash{}$\{} \oder{} \regex{$\backslash{}$\}}\\

\rul{quotechar} \regex{["'´`]}\\
\end{flushleft}

An infix declaration causes interpretation of \hyperref[binex]{infix expressions} based on the operators associativity (left, right or none) and precedence.
Operators with higher precedence bind their operands before operators with lower precedence, so the precedence is to be taken as an indication of binding power.
Precedences range from 1 (weakest binding) to 16 (tightest binding).

In an expression containing operators, consecutive unparenthesized operators must have different precedences or they must both be either left or right associative to
avoid a syntax error.


See also the syntax of \emph{binary expressions} in \autoref{binex}, the example in \autoref{exprparse} and the table of predefined operators in \autoref{predefops}.

\subsection{Rules for using backquotes}

Every sequence of characters forming a valid operator symbol that is enclosed in backquotes will be recognized as an operator token. If the operator was not previously introduced through a fixity declaration it will be assumed that it is non-associative and has a precedence of 16.


For \nont{wordop}s it is necessary that one always explicitly indicates when one wants to \textit{use} them as operator. Thus, \nont{wordop}s must always be quoted with backquotes when they are in infix position. However, in the infix declaration all that matters is to announce the character sequence an operator is made of. Thus, backticks are not strictly needed when introducing word operators.

\begin{figure}

\begin{code}

infix 12 == !=           -- non associative
infixr 13 ++             -- right associative
infixl 14 div            -- left associative word operators, 
infixl 14 `mod`          -- backticks don't matter here
infixr 4 :               -- right associative
infixr 16 **             -- ditto, but binds tighter than `:`

a == b != c              -- syntax error, set parentheses explicitly
a ++ b ++ c  == d        -- (a ++ (b ++ c)) == d
a ** b ** c : d : e      -- (a ** (b ** c)) : (d : e)
a `mod` b                -- backticks make mod an operator
f div 2                  -- div is not used as operator here
                         -- but is an argument for function f
\end{code}

\caption{Parsing of expressions containing operators} \label{exprparse}
\end{figure}

\subsection{Imported operators} \label{importedops}

A module import (see also \autoref{import}) makes all fixities and precedences that were defined in the imported module available. Yet, depending on the import statement, certain functions can only be used with qualified names. It is possible to qualify operators:

\begin{flushleft}
\rul{qlexop} \nont{qualifier} \nont{lexop} \oder{} \nont{qualifier} \nont{qualifier} \nont{lexop}
\end{flushleft}

It is possible, but strongly discouraged, to assert different precedences and associativities for imported operators:

\begin{code}
import some.Module (***)
infix 7 ***
\end{code}

Compilers should emit warning messages in such cases to avoid inadvertently doing this.

\hasdiff{
\begin{itemize}
%item fixity is a lexical and syntactical property of certain operator symbols
\item fixity declarations are permitted at top level only
\item an operator whose fixity was not declared is taken to be non-associative and to have precedence 16 (tightest biding)
\item to use an operator \texttt{op} from name space \texttt{M} one writes \texttt{M.`op`}
\end{itemize}
}

\section{Unary operators}

There are two symbols that may be used as unary operators:

\begin{flushleft}
\rul{unop} \sym{!} \oder{} \sym{?}
%\rul{qunop} \nont{qualifier} \nont{unop} \oder{} \nont{unop}
\end{flushleft}

Unary operators can not be qualified. It is strongly discouraged to use them as names for own functions.

The unary operator \sym{!} is the boolean negation function; in patterns it has special meaning that signals  \hyperref[strictpats]{strict patterns}.

The unary operator \sym{?} is currently unassigned and reserved for future use. 

\section{Literals}

Literals are textual representations of values of certain simple types. All literals are valid expressions as well as \hyperref[patterns]{patterns}.

\begin{flushleft}
\rul{literal} \nont{boolliteral} \oder{} \nont{numericliteral} \alt{} \nont{charliteral} \oder{} \nont{stringliteral} \oder{} \nont{regexliteral}\\
\rul{numericliteral} \nont{integerliteral} \oder{} \nont{floatliteral}
\end{flushleft}

\hasdiff{Literal syntax is adopted from \java{}. Every literal determines a fixed type.}

\subsection{Boolean Literals} \label{boolliteral}

The boolean values are represented by the keywords \term{true} and \term{false}. 
Boolean values are of type \hyperref[boolean]{\texttt{Bool}}.

\begin{flushleft}
\rul{boolliteral} \term{true} \oder{} \term{false}
\end{flushleft}

\subsection{Numeric Literals}

The syntax of numeric literals follows closely that of \java{}, except that some exotic form of floating point literals are not supported.
In addition, there are literals for big integers.

Furthermore, for all numeric literals, the syntax of the integral part has been slightly extended: it is possible to separate trailing groups of 3 digits each with an underscore. This enhances legibility greatly with big numbers.

\paragraph*{Example}
The literal for the long integer value fifty-two billion four hundred and twenty-five million two hundred and fifty-four thousand five hundred and twenty-four can be written \texttt{52\_425\_254\_524L} or \texttt{52425254524L}.


\subsubsection{Integer Literals}

There are literals for values of three different integer types: \texttt{Int}, \texttt{Long} and \texttt{Integer}.

\begin{flushleft}
\rul{integerliteral} \nont{intliteral} \oder{} \nont{longliteral} \oder{} \nont{bigintliteral}\\
\rul{intliteral} as defined in \java{}, see section 3.10.1 in \cite{langspec3}\\
\rul{longliteral} as defined in \java{}, see section 3.10.1 in \cite{langspec3}\\
\rul{bigintliteral} \regex{$\backslash$d+(\_$\backslash$d$\backslash$d$\backslash$d)*[nN]} \\
\end{flushleft}

\frege{} adopts the syntax for integer literals from \java{}. An integer literal that would have type \texttt{int} in \java{} has type \texttt{Int} in \frege{}. An integer literal that would have type \texttt{long} in \java{} has type \texttt{Long} in \frege{}.

In addition, a sequence of decimal digits followed by one of the letters \texttt{n} or \texttt{N} (think \emph{natural} number) is a literal of type \texttt{Integer}, the data type of integral numbers of arbitrary size. Note that leading zeros do not indicate octal numbers as with the other integer literals.


\subsubsection{Floating-Point Literals}

There are literals for values of the \texttt{Float} and \texttt{Double} types. The syntax is a subset of that for \java{} floating point literals as specified in section 3.10.2 of \cite{langspec3}. Not supported are floating point literals that do not start with a digit and hexadecimal floating point literals.

\begin{flushleft}
\rul{floatliteral}
as defined in \java{}, see section 3.10.2 in \cite{langspec3}\\
\hspace{0.5in} except hexadecimal notation and literals\\
\hspace{0.5in} that start with a decimal point
\end{flushleft}

A literal that would have type \texttt{float} in \java{} has type \texttt{Float} in \frege{}. A literal that would have type \texttt{double} in \java{} has type \texttt{Double} in \frege{}.

\subsection{Character and String Literals}

\subsubsection{Character Literals} \label{charliteral}

Character literals are like \texttt{char} literals in \java{} and have type \texttt{Char}.

\begin{flushleft}
\rul{charliteral}
as defined in \java{}, see section 3.10.4 in \cite{langspec3}\\
\end{flushleft}

In addition, character literals that consist of a single Unicode escape sequence like \texttt{'$\backslash$u89ab'} are valid. There must be exactly 4 hexadecimal digits after the \term{u}.

\subsubsection{String Literals}

String literals are like \texttt{String} literals in \java{} and have type \texttt{String}. Unicode escapes in strings are valid.

\begin{flushleft}
\rul{stringliteral}
as defined in \java{}, see section 3.10.5 in \cite{langspec3}\\
\end{flushleft}

\note{\java{} programmers: Please observe that the string concatenation operator is \texttt{++} in \frege{}}.

\subsubsection{Literals for Regular Expressions} \label{regexliteral}

The \frege{} language supports regular expressions as a built in data type. Consequently it is possible to specify regular expressions literally. Such literals denote values of type \texttt{Regex} unless they are not well formed by the rules of the regular expression language. In the latter case, the compiler issues an error message and the program containing the ill formed literal does not compile.

\begin{flushleft}
\rul{regexliteral}  \regex{´($\backslash{}\backslash{}$´|[\symbol{94}$\backslash{}$´ ])*´}
\alt \regex{'($\backslash{}\backslash{}$'|[\symbol{94}$\backslash{}$'])*'}
\end{flushleft}

A regular expression literal is enclosed in grave accent marks and is a sequence of 0 or more characters that are not grave accent marks unless they are escaped with backslashes.

Alternatively, one may enclose the regular expression in apostrophes \emph{unless} it could be mistaken as a character literal. A regular expression that looks for a single character \term{X} can be written \term{'(?:X)'}.

The regular expression language is the one implemented in the \texttt{java.util.regex} package. It is documented along with the class \texttt{java.util.regex.Pattern} in \cite{apidoc}.
The only difference is that the grave accent mark (or the apostrophe, if that was used to enclose the literal) is a special character that signals the end of the regular expression.
If one wants to match a grave accent mark (or an apostrophe), one must write a backslash followed by a grave accent mark (or an apostrophe) in the regular expression.

Regular expression literals are compiled with the flags \texttt{CANON\_EQ}, \texttt{UNICODE\_CASE} and \texttt{UNICODE\_CHARACTER\_CLASS}. The latter two may be reset with embedded flag expressions \texttt{(?-u)} and \texttt{(?-U)}, respectively.

\note{A single backslash in a regex literal is the escape character for the regular expression language. Thus, for instance, the literal \regex{´$\backslash$ba´} means \emph{"the regular expression that matches the letter 'a' after a word boundary"} and not \emph{"... that matches the backspace character followed by the letter 'a'"}.
}

It is also possible to construct a string that contains a pattern and compile that to a pattern value. However, regular expression  literals are superior compared to string literals with pattern text
\begin{itemize}
\item because there is one level of backslash-interpretation less, thus one needs to write only half the number of backslashes
\item invalid regular expression literals are flagged at compile time, not when they are about to be used at runtime
\item regular expression literals will be replaced with references to read only pattern values that are built at program startup time. Thus one can safely use regular expression literals everywhere without performance penalty due to repeated pattern compilation. This has the added benefit that one can immediately see what the regular expression is and does not have to look it up somewhere else in the program code.
\end{itemize}
The bottom line is: one should use regular expression literals whenever possible.

\section{Layout} \label{layout}

\frege{} permits the omission of the braces and semicolons by using \emph{layout} to convey the same information.
This allows both layout-sensitive and layout-insensitive styles of coding, which can be freely mixed within one program.
\footnote{Though, experience with \haskell{} seems to show that literally all source code is written using layout.}
Because layout is not required, \frege{} programs can be straightforwardly produced by other programs.

Informally stated, the braces and semicolons are inserted as follows.
The layout (or "offside") rule takes effect whenever the open brace is omitted after the keyword \term{where}, \term{let}, \term{do}, or \term{of}.
When this happens, the indentation of the next lexeme (whether or not on a new line) is remembered and the omitted open brace is inserted (the whitespace preceding the lexeme may include comments).
For each subsequent line, if it contains only whitespace or is indented more, then the previous item is continued (nothing is inserted);
if it is indented the same amount, then a new item begins (a semicolon is inserted);
and if it is indented less, then the layout list ends (a close brace is inserted).

The layout rule matches only those open braces that it has inserted; an explicit open brace must be matched by an explicit close brace.
Within these explicit open braces, no layout processing is performed for constructs outside the braces, even if a line is indented to the left of an earlier implicit open brace.
%See \autoref{layoutrules} for a more precise definition of the layout rules.

\hasdiff{
The token following a \term{where}, \term{let}, \term{do} or \term{of} keyword must either be an opening brace or it must be more indented than  the current level, otherwise the layout algorithm will insert a closing brace which will result in a syntax error. In other words, the token sequence $\lbrace{}\rbrace{}$ will never be inserted.

The layout handling is a purely lexical matter, hence it is not possible to insert a closing brace \emph{"if an illegal lexeme
is encountered at a point where a closing brace would be legal"}.

However, a closing brace is inserted before the keyword \term{in} regardless of the indentation unless it is already preceded by a closing brace. Hence, one can still write let expressions on a single line.
}